Summary of work completed

Wrangling and data cleaning

  • Merged metadata sources from all field collections
  • Various cleaning and whatnot
  • Spatial clustering of points to assign “populations”
  • Merged clustered metadata with genotype data

COLONY/pedigree analysis

  • Prepared COLONY input data (see assumptions below)
  • Ran COLONY for various datasets
  • Wrangled and examined COLONY outputs
    • ⚠️ There are issues!! ⚠️

Preliminary genetic analysis

  • Various quality checks and summary statistics
  • “Toy” versions of popgen analysis like He/Ho and F-stats
    • ⚠️ There are issues!! ⚠️

Outstanding issues and points of confusion


Next Steps


Preliminary “Results”

Data Availability and Map of collections

  • Total specimens in USDA database: 665
  • Total specimens with ANY genetic data in USDA database: 498
  • Specimens after basic filtering: 465
    • Had known lat-long, sex, and other basic metadata
    • Was from 2020 or 2021
    • Other minor filtering
  • Specimens used in COLONY: 307
    • Females only
    • Had at least 10 loci with data (vast majority have 13)
    • Was NOT from a known colony (this removes quite a few specimens)
  • Specimens in pop-gen analyses: 307
    • Same filters as COLONY
    • ⚠️ Have not yet excluded siblings due to inconsistency with COLONY and because it’s debatable if that’s a wise step anyway

Figure 1. Map showing the collection locations of 465 specimens available after basic initial filtering. Colors represent “population cluster” as assigned by tree method (tree length = 10km). Hover over the dots to see what cluster they were assigned to!

COLONY results and issues

In this table you can see various issues like colony assignments across clusters (so really long distances) or years (so, siblings not possible).

This table similarly shows how many colonies this is an issue for (singletons excluded)

Preliminary genetic analysis

Assumptions

Because the output of COLONY is so suspect at the time, I wanted to proceed without excluding siblings. There’s debate as to whether or not this is “valid”. The basic idea is you exclude siblings since they represent a biased subset of genetic variability…but if you’re sampling randomly, and there’s lot of siblings in the data, then it’s possibly because there’s not a lot of variability in the population so you should expect to encounter (and include) siblings a lot.

In any event, the following analysis is with the same specimens as the COLONY run: females, from unknown colonies, with at least 10 loci of data

Maybe at this time this is more for “sake of argument” until we chase down the underlying issues with COLONY…or maybe this will help reveal why COLONY is struggling…

Quality checks and summary statistics

Missing loci

There are no loci with <80% completeness

## named numeric(0)

Individuals with poor data

There are no individuals with poor data in the dataset…but that is because I also excluded them earlier in the process. So this is just a sanity check.

## named numeric(0)

Duplicates or clones?

This suggests there might be a couple clones/duplicates in here. This is possible because of low genetic variability…or simply someone got double counted. Explore this more. The three that are clones are from adjacent values (e.g. BAFF429 and BAFF430) - so potential from cross contamination or simply just they were caught from the same population..

## #############################
## # Number of Individuals:  307 
## # Number of MLG:  304 
## #############################
## [1] 304

Polymorphic

Just a little sanity check that all the loci are polymorphic (they are)

##    Mode    TRUE 
## logical      13

Summary Statistics

Table of various summary stats

Observed v. Expected Heterozygosity

Only displaying populations with 5 or more specimens.

Figure 2. Observed (blue) versus expected (grey) heterozygosity across the regions. Numbers above bars are the sample size for each region.

Same (except ALL clusters, not just those with >=5 specimens) as figure but in table format for investigating to your heart’s content. Added some math for a little helper.

F-Statistics

This is just kinda trash for now. Need to determine how we really wanna cluster things. This results in way too many clusters to be intelligible so I haven’t really worked with getting it clean.

Below is with only populations having >=5 specimens